# install.packages("ggplot2","dplyr")
library(ggplot2)
library(dplyr)
library(tidyr)
The purpose of this notebook is to demonstrate some of what is possible for visualisation of a text. Quantitative analysis is a tool that can help to answer some questions, but it is not always useful and there are many questions it cannot address. I hope to demonstrate below some of the things that can be done, and hopefully it will be more inspiring that intimidating.
First, there must be a corpus or digitized text that can be analysed computationally. For this demonstration, I’ve used a corpus of Shakespeare’s plays and adapted some code from a Kaggle notebook.
# R must be at least 3.3.1 for `tm` and `slam` to work.
# install.packages("tm")
# install.packages("SnowballC")
library(tm)
#system("ls ../input") # do we need this?
Before we get into anything fun, we have to see what the corpus looks like; that is, how the data frame is structured. These are the first six lines of the corpus. NB: there is currently a small bug in the software that prevents the data from being shown neatly. It should be fixed soon.
shak<-read.csv("../data/Shakespeare_data.csv",header = TRUE, as.is = TRUE)
#shak<-na.omit(shak)
head(shak)
Dataline Play PlayerLinenumber ActSceneLine Player
1 1 Henry IV NA
2 2 Henry IV NA
3 3 Henry IV NA
4 4 Henry IV 1 1.1.1 KING HENRY IV
5 5 Henry IV 1 1.1.2 KING HENRY IV
6 6 Henry IV 1 1.1.3 KING HENRY IV
PlayerLine
1 ACT I
2 SCENE I. London. The palace.
3 Enter KING HENRY, LORD JOHN OF LANCASTER, the EARL of WESTMORELAND, SIR WALTER BLUNT, and others
4 So shaken as we are, so wan with care,
5 Find we a time for frighted peace to pant,
6 And breathe short-winded accents of new broils
Each column is labeled and the content of the column is consistent for each row (all 111396 of them!). Some of the rows may not be useful. Some contain empty cells (labeled NA). Some contain a lot of information and we might need to do some processing on them before we can use the information quantitatively.
The first thing we’ll look at is word frequency, or how often a string (in this case “love”) occurs in the data frame. To do this, we must identify every time the word “love” appears and highlight it in a way so that it can be counted based on different properties of its environment (e.g., by play, by player, by scene, etc).
Here are the first 10 rows of a data frame that contains the number of times “love” appears in each play. It’s been sorted in descending order, but doesn’t contain any other information about where and when the word occurs.
# play level word frequency
plays <- unique(shak$Play)
loveFreq<-numeric()
for (i in 1:length(plays)){
text <- Corpus(VectorSource(paste(shak[shak$Play==plays[i],]$PlayerLine,collapse=" ")))
text <- tm_map(text, removePunctuation)
text <- tm_map(text, PlainTextDocument)
text <- tm_map(text, removeWords, stopwords('english'))
# stemming to merge all "loved", "loving" into one
text <- tm_map(text, stemDocument)
tdm <- TermDocumentMatrix(text)
loveFreq[i]<-as.numeric(slam::row_sums(tdm)["love"])
}
lPlay <- data.frame(plays,loveFreq)
lPlay <- na.omit(lPlay)
# order the plays based on the occurence of love
lPlay<-lPlay[order(-lPlay$loveFreq),]
head(lPlay,10)
plays loveFreq
35 Two Gentlemen of Verona 188
28 Romeo and Juliet 160
6 As you like it 138
22 A Midsummer nights dream 128
17 Loves Labours Lost 125
23 Much Ado about nothing 122
24 Othello 108
33 Troilus and Cressida 87
27 Richard III 86
11 Hamlet 85
We can also look at which players say “love” the most over the course of their appearences. These are only the top 10 players who use the word “love” most.
# player level word frequency
players <- unique(shak$Player)
loveFreq <- numeric()
for (i in 1:length(players)){
text <- Corpus(VectorSource(paste(shak[shak$Player==players[i],]$PlayerLine,collapse=" ")))
text <- tm_map(text, removePunctuation)
text <- tm_map(text, PlainTextDocument)
text <- tm_map(text, removeWords, stopwords('english'))
text <- tm_map(text,stemDocument)
tdm <- TermDocumentMatrix(text)
loveFreq[i] <- as.numeric(slam::row_sums(tdm)["love"])
}
lPlayer <- data.frame(players,loveFreq)
lPlayer <- na.omit(lPlayer)
#order
lPlayer <- lPlayer[order(-lPlayer$loveFreq),]
head(lPlayer,10)
players loveFreq
904 PROTEUS 59
190 ROSALIND 57
771 ROMEO 56
169 HELENA 46
906 JULIA 41
650 IAGO 40
572 JULIET 37
38 GLOUCESTER 36
517 BIRON 36
639 BENEDICK 36
We can also look at the bottom of the list. These are 10 players who only say “love” once, although there are likely many others who are also tied for last.
tail(lPlayer,10)
players loveFreq
890 CALCHAS 1
899 SIR TOBY BELCH 1
903 FABIAN 1
914 Third Outlaw 1
917 ARCHIDAMUS 1
922 MAMILLIUS 1
925 PAULINA 1
932 PERDITA 1
933 DORCAS 1
934 MOPSA 1
Is this useful to you? Can word frequency by character/player, scene, act, play, or author help to answer any of your research questions?
I think the main way quantitative analysis can be of use to the humanities is by visualising properties of the text that might not be immediately apparent from reading. Word frequency is one of these properties, since we (as humans) don’t typically keep track of how often each characters says any given word. If you’re interested in how different characters or different authors make use of certain words or phrases, visualising the distribution of those strings might uncover patterns that are otherwise difficult to find.
For instance, maybe you are curious how the longer and shorter plays compare. Instead of hand-counting each, we can graph and order them. Based on this graph, you don’t need to know exactly how long each is, but you can see that Othello is much longer than Loves Labours Lost, which can inform how you approach the comparison.
shak %>%
group_by(Play) %>%
summarise(n = n()) %>%
ggplot(., aes(x=reorder(Play, n),y=n)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Length of Shakespeare's plays") +
theme(legend.position="none") +
xlab("Play") +
ylab("Number of lines")
Within a single play, maybe we want to know which characters are the chattiest. We can visualise the number of lines of text per character to get a sense of who is dominating the stage.
Obviously, it’s Hamlet.
shak %>%
filter(Play == "Hamlet") %>%
group_by(Player) %>%
summarise(n = n()) %>%
ggplot(., aes(x=reorder(Player, n),y=n)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Speech in Hamlet") +
theme(legend.position="none") +
xlab("Player") +
ylab("Number of lines")
One property of much real-life, natural language data (and many other phenomena in human behaviour) is that frequency of different events or items tend to follow a Zipf distribution. This just means that there are a very small number of incredibly frequent things, and a very large number of very infrequent things. One property of this distribution is that it can look like a very steep curve when plotted normally, but when plotted logarithmically, it looks more like a straight line.
Since it appears that the number of lines per player in Hamlet follow a Zipf curve, we can easily change the scale of the x-axis (the bottom of the chart) to a logarithmic scale. This means that each unit of distance from the lower left is 10 times the value of the previous unit. The distance from 0 to 1 will appear the same as between 1 and 10, which will appear the same as between 10 and 100, and then again between 100 and 1000. This kind of scale will deemphasize the absolute differences in frequency among the most frequent things and help resolve nuanced differences among the least frequent things.
When we make this change to the from above visualisation, suddenly we see a lot of nuance in the “long tail” of the data. The players with the fewest lines don’t all still have the same number, and this might be useful information about who speaks when.
shak %>%
filter(Play == "Hamlet") %>%
group_by(Player) %>%
summarise(n = n()) %>%
ggplot(., aes(x=reorder(Player, n),y=n)) +
geom_bar(stat="identity") +
coord_flip() +
ggtitle("Speech in Hamlet") +
theme(legend.position="none") +
xlab("Player") +
ylab("Number of lines (logarithmic scale)") +
scale_y_log10()
We can also look across plays for frequency. By comparing which plays have the word “love” the most often, we might be able to group them (perceptually) into plays about love and those that are not. Maybe?
lPlay %>%
ggplot(., aes(x=reorder(plays, loveFreq),y=loveFreq)) +
geom_bar(aes(),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("frequency of the word 'love'") +
theme(legend.position = "none")
One thing that graphs can do very easily is give you a way to identify trends when you sort events (e.g., plays) into multiple different categories. For instance, the frequency graph above is interesting, but there are so many plays and as a non-expert, I can’t tell you what each is about, what style it is written in, or whether I’d expect it to be about “love” or not. So, we can add another dimension of information. In the following graph, each color represents a different category (as determined by Wikipedia’s First Folio page, plus information about the “late romances”). Now, we can see if there are trends for different categories to mention “love” more or less than the others.
lPlayCat <- lPlay
lPlayCat$category <- NA
lPlayCat$category[lPlayCat$plays == "A Comedy of Errors" |
lPlayCat$plays == "As you like it" |
lPlayCat$plays == "Alls well that ends well" |
lPlayCat$plays == "Loves Labours Lost" |
lPlayCat$plays == "Measure for measure" |
lPlayCat$plays == "Merchant of Venice" |
lPlayCat$plays == "Merry Wives of Windsor" |
lPlayCat$plays == "A Midsummer nights dream" |
lPlayCat$plays == "Much Ado about nothing" |
lPlayCat$plays == "Taming of the Shrew" |
lPlayCat$plays == "Twelfth Night" |
lPlayCat$plays == "Two Gentlemen of Verona"] <- "comedy"
lPlayCat$category[lPlayCat$plays == "Pericles" |
lPlayCat$plays == "Cymbeline" |
lPlayCat$plays == "A Winters Tale" |
lPlayCat$plays == "The Tempest"] <- "romance"
lPlayCat$category[lPlayCat$plays == "King John" |
lPlayCat$plays == "Richard II" |
lPlayCat$plays == "Richard III" |
lPlayCat$plays == "Henry IV" |
lPlayCat$plays == "Henry V" |
lPlayCat$plays == "Henry VI Part 1" |
lPlayCat$plays == "Henry VI Part 2" |
lPlayCat$plays == "Henry VI Part 3" |
lPlayCat$plays == "Henry VIII" |
lPlayCat$plays == "Coriolanus" |
lPlayCat$plays == "Julius Caesar" |
lPlayCat$plays == "Antony and Cleopatra" |
lPlayCat$plays == "King Lear" |
lPlayCat$plays == "macbeth"] <- "history"
lPlayCat$category[lPlayCat$plays == "Titus Andronicus" |
lPlayCat$plays == "Romeo and Juliet" |
lPlayCat$plays == "Hamlet" |
lPlayCat$plays == "Troilus and Cressida" |
lPlayCat$plays == "Othello" |
lPlayCat$plays == "Timon of Athens"] <- "tragedy"
# sort(unique(lPlay$plays))
lPlayCat %>%
ggplot(., aes(x=reorder(plays, loveFreq),y=loveFreq)) +
geom_bar(aes(fill=category),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("frequency of the word 'love'")
It seems to me that comedies and tragedies discuss “love” the most, whereas histories and the late romances discuss it the least. Is this intuitive? Maybe. But there’s a problem. A Comedy of Errors has the fewest mentions of “love”, but it’s also the shortest play, so it has the fewest words overall. What we really want to see is the proportion of “love”-frequency per play, not the raw counts. To do that, we have to add in the total length of each play to the data frame.
playLength <- shak %>%
group_by(Play) %>%
summarise(n = n())
lPlayCat$length <- NA
for (i in 1:length(playLength$n)) {
lPlayCat$length[lPlayCat$plays==playLength$Play[i]] <- playLength$n[playLength$Play==playLength$Play[i]]
}
lPlayCat %>%
mutate(proportion = loveFreq/length) %>%
ggplot(., aes(x=reorder(plays, proportion),y=proportion)) +
geom_bar(aes(fill=category),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("proportional frequency of the word 'love'")
Not a whole lot has changed, but I think the distribution of comedies and tragedies is even more pronounced. And, we have more information about A Comedy of Errors, which is still very close to the bottom of the graph. Not every comedy is about love, it seems.
#lPlayer %>%
# filter(loveFreq > 20) %>%
# ggplot(., aes(x=reorder(players, loveFreq),y=loveFreq)) +
# geom_bar(aes(fill=players),stat="identity") +
# coord_flip() +
# ggtitle("Love in each play") +
## theme(legend.position="none") +
# xlab("Play") +
# ylab("frequency of the word 'love'") +
# theme(legend.position = "none")
Finally, we can generate these same types of graphs for different subgroups, too. Here’s one example, where we look at the number of lines each player has, focusing only on players who have greater than 700 lines. We can also see if there are any trends in these top speakers by play category. It seems to me that the histories dominate, but Hamlet and Iago dominate the scene (so to speak).
shak %>%
group_by(Play,Player,category) %>%
summarise(n = n()) %>%
filter(n > 700) %>%
ggplot(., aes(x=reorder(Player, n),y=n)) +
geom_bar(aes(fill=category),stat="identity") +
coord_flip() +
ggtitle("Amount of lines by character") +
# theme(legend.position="none") +
xlab("Player") +
ylab("Number of lines")
Is this because histories and tragedies tend to be longer plays, overall? Quite possibly:
library(dplyr)
#install.packages("tidytext")
library(tidytext)
shak %>%
as_tibble(.) %>%
unnest_tokens(tbl=., input = PlayerLine, output = word)
[38;5;246m# A tibble: 820,204 x 6[39m
Dataline Play PlayerLinenumber ActSceneLine Player word
[3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m
[38;5;250m 1[39m 1 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m act
[38;5;250m 2[39m 1 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m i
[38;5;250m 3[39m 2 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m scene
[38;5;250m 4[39m 2 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m i
[38;5;250m 5[39m 2 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m london
[38;5;250m 6[39m 2 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m the
[38;5;250m 7[39m 2 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m palace
[38;5;250m 8[39m 3 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m enter
[38;5;250m 9[39m 3 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m king
[38;5;250m10[39m 3 Henry IV [31mNA[39m [38;5;246m"[39m[38;5;246m"[39m [38;5;246m"[39m[38;5;246m"[39m henry
[38;5;246m# ... with 820,194 more rows[39m
lPlayer %>%
filter(loveFreq > 20) %>%
ggplot(., aes(x=reorder(players, loveFreq),y=loveFreq)) +
geom_bar(aes(fill=players),stat="identity") +
coord_flip() +
ggtitle("Love in each play") +
# theme(legend.position="none") +
xlab("Play") +
ylab("frequency of the word 'love'") +
theme(legend.position = "none")
shak %>%
as_tibble(.) %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
Joining, by = "word"
[38;5;246m# A tibble: 24,148 x 2[39m
word n
[3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m thou [4m5[24m193
[38;5;250m 2[39m thy [4m3[24m727
[38;5;250m 3[39m thee [4m3[24m024
[38;5;250m 4[39m lord [4m2[24m621
[38;5;250m 5[39m sir [4m2[24m454
[38;5;250m 6[39m enter [4m2[24m338
[38;5;250m 7[39m love [4m1[24m927
[38;5;250m 8[39m hath [4m1[24m845
[38;5;250m 9[39m king [4m1[24m500
[38;5;250m10[39m tis [4m1[24m384
[38;5;246m# ... with 24,138 more rows[39m
shak %>%
as_tibble(.) %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
filter(n>800) %>%
ggplot(., aes(x=reorder(word,n),y=n)) +
geom_bar(stat="identity") +
coord_flip()
Joining, by = "word"
How can we organise this so that we can compare across plays?
shak[,c(2,5,6)] %>%
as_tibble() %>%
unnest_tokens(tbl=., input = PlayerLine, output = word) %>%
filter(word=="love" | word =="king" | word=="death" | word=="sweet") %>%
#add_count(Player) %>%
group_by(Player,Play,word) %>%
summarise(n=n()) %>%
#anti_join(stop_words) %>%
filter( Play == "Hamlet" |
Play == "King Lear" |
Play == "A Midsummer nights dream" |
Play == "Othello" |
Play == "Henry V" |
Play == "Romeo and Juliet") %>%
arrange(desc(n)) %>%
ggplot(., aes(x=word,y=n)) +
geom_bar(aes(fill=word),stat="identity") +
# coord_flip() +
facet_wrap(~Play)
Is there a way to break it down to see who is saying what?
word <- c(NA,"thou","thee","thy","thine","dost","shalt","wilt","hast","hath","scene","tis","ii","iii","iv","v","vi","vii")
lexicon <- rep("shakespeare",length(word))
new_stop <- cbind(word,lexicon)
shak_stop <- rbind(new_stop,stop_words)
shak %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE)
[38;5;246m# A tibble: 57,371 x 3[39m
word1 word2 n
[3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m enter king 101
[38;5;250m 2[39m mine eyes 95
[38;5;250m 3[39m king henry 88
[38;5;250m 4[39m sir john 80
[38;5;250m 5[39m mark antony 76
[38;5;250m 6[39m mine honour 71
[38;5;250m 7[39m king richard 51
[38;5;250m 8[39m god save 48
[38;5;250m 9[39m gracious lord 46
[38;5;250m10[39m noble lord 46
[38;5;246m# ... with 57,361 more rows[39m
#install.packages("igraph")
#install.packages("ggraph")
library(igraph)
library(ggraph)
library(grid)
set.seed(814)
a <- grid::arrow(type = "closed", angle=22.5, length = unit(.1, "inches"))
p1 <- shak %>%
filter(Play=="Hamlet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkblue", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
set.seed(814)
p2 <- shak %>%
filter(Play == "Twelfth Night") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkred", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "salmon", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
set.seed(814)
p3 <- shak %>%
filter(Play == "Romeo and Juliet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 6) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkgreen", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green2", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
multiplot(p1,p2,p3,cols=3)
set.seed(814)
a <- grid::arrow(type = "closed", angle=22.5, length = unit(.1, "inches"))
p1 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play=="Hamlet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkblue", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Hamlet")
set.seed(814)
p2 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Twelfth Night") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkred", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "salmon", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Twelfth Night")
set.seed(814)
p3 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Romeo and Juliet") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkgreen", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "green2", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Romeo and Juliet")
set.seed(814)
p4 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Othello") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkorange", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "orange", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Othello")
set.seed(814)
p5 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "Henry IV") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="cadetblue4", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "cyan", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("Henry IV")
set.seed(814)
p6 <- shak %>%
filter(ActSceneLine != "") %>%
filter(Play == "The Tempest") %>%
as_tibble() %>%
unnest_tokens(input = PlayerLine, output = bigram, token = "ngrams", n = 2) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
count(word1, word2, sort = TRUE) %>%
filter(n > 3) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="violet", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "magenta", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void() +
ggtitle("The Tempest")
multiplot(p1,p2,p3,p4,p5,p6,cols=3)
This should give us a better idea of slightly looser connections
set.seed(814)
a <- grid::arrow(type = "closed", angle=22.5, length = unit(.1, "inches"))
shak %>%
as_tibble() %>%
filter(ActSceneLine != "") %>%
unnest_tokens(input = PlayerLine, output = trigram, token = "ngrams", n = 3) %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% # separates bigram into two columns, one for each word
filter(!word1 %in% shak_stop$word) %>% # filters stop words from first column
filter(!word2 %in% shak_stop$word) %>% # filters stop words from second column
filter(!word3 %in% shak_stop$word) %>% # filters stop words from third column
count(word1, word2, word3, sort = TRUE) %>%
filter(n > 2) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n), edge_colour="darkblue", show.legend = FALSE,
arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color = "lightblue", size = 5) +
geom_node_text(aes(label = name), repel=TRUE) + # , vjust = 1, hjust = 1) +
theme_void()
What happens if we treat the first pair and second pair of trigrams as separate bigrams and graph them as before?
shak %>%
as_tibble() %>%
#filter(Play == "Hamlet" | Play == "Loves Labours Lost" | Play == "A Winters Tale") %>%
filter(ActSceneLine != "") %>%
mutate(ActSceneLine2 = ActSceneLine) %>%
separate(ActSceneLine2, c("act", "scene", "line")) %>%
count(Play,act,scene, sort=TRUE) %>%
transmute(play=Play, act=as.numeric(act), scene=as.numeric(scene), n=n)
[38;5;246m# A tibble: 737 x 4[39m
play act scene n
[3m[38;5;246m<chr>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<dbl>[39m[23m [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Loves Labours Lost 5 2 972
[38;5;250m 2[39m A Winters Tale 4 4 929
[38;5;250m 3[39m Hamlet 2 2 616
[38;5;250m 4[39m King John 2 1 609
[38;5;250m 5[39m The Tempest 1 2 596
[38;5;250m 6[39m Cymbeline 5 5 584
[38;5;250m 7[39m Measure for measure 5 1 580
[38;5;250m 8[39m Timon of Athens 4 3 577
[38;5;250m 9[39m Richard III 4 4 561
[38;5;250m10[39m A Winters Tale 1 2 539
[38;5;246m# ... with 727 more rows[39m
What this all seems to tell us is that we can visualise the structure of the play, separate from their content. Is this useful to you?